Introduction

With future climate change droughts in the Amazon forest may become more frequent and/or severe. Droughts can turn Amazon regions from rain forest into savanna, leading to high amounts of carbon released into the atmosphere. Therefore, predicting future droughts and understanding the underlying mechanisms is of great interest. Ciemer et al. (2020), established an early warning indicator for droughts in the central Amazon basin (CAB), based on tropical Atlantic sea surface temperatures (SSTs). Inspired by their work, the aim of this thesis is to build up on this work and improve its predictive power by using different statistical methods. Here we seek to build a model that is able to predict monthly precipitation based on the sea surface temperatures. Also we want to identify those sea regions that are most important for doing so, making interpretability a point of interest, too. Firstly we will analyze the data descriptively to explore patterns and spatial dependencies. This includes a cluster analysis of the precipitation data in the central Amazon basin. Following we will compare two different regression approaches and their capability to predict precipitation as well as their interpretability of the SST regions selected by them. The first model is the lasso as proposed by R. Tibshirani (1996). Comparing different model specifications we will carry on the findings from the lasso and fit a (sparse) fused lasso on the data (R. Tibshirani et al. (2005)). Both models will be evaluated using a 5-fold forward selection, a model evaluation technique that takes into account the time dependencies present in the data at hand. We conclude with a summary of the findings in this work and give an overview of strengths and limitations of the approaches used together with ideas for future research.

This thesis was written and supervised in cooperation with Dr. Niklas Boers from the Potsdam Institute for Climate Impact Research (Climate Impact Research (PIK) e. V. (2022)) and Dr. Fabian Scheipl (LMU)

2 EDA

In this chapter we will inspect the values of the precipitation and SST data for the common observation period from 1981 until 2016.

2.1 EDA precipitation

In this section we want to study the time series of precipitation in the Central Amazon Basin. The CHIRPS data set contains the precipitation data, created from in-situ and satellite measurements (Funk et al. (2015)). It can be downloaded for example from here. It contains observations from 1981 to 2016 and comes on a high resolution of 0.05 grid, which we aggregate to a 0.5 grid.

Below the area of the Central Amazon Basin that is object of our study.

Location of the area under study. The central amazon basin (CAB) spanning across 0,-10 latitude and -70,-55 longitude

Figure 2.1: Location of the area under study. The central amazon basin (CAB) spanning across 0,-10 latitude and -70,-55 longitude

We will now inspect the precipitation values from three perspectives. Firstly the raw values without time or spatial dependency, second the mean and standard deviation for each spatial grid cell for the whole time series and then the mean and standard deviation for each grid cell but for each month of the year separately.

Firstly we plot precipitation values in general
Density of the raw precipitation values without time or spatial dependency.

(#fig:precip dens)Density of the raw precipitation values without time or spatial dependency.

Its form is a uni-modal, right-skewed density. The values range from 0 up to 769, but only few observations take these high values, forming a large tail. This might be a indication for large outliers in the data or due to some locations with very high precipitation values in general.

Mean and standard deviation at each location. The standard deviation was computed over the whole time period. The white line on the scale at the side of the plots indicates the mean of the respective quantity

Figure 2.2: Mean and standard deviation at each location. The standard deviation was computed over the whole time period. The white line on the scale at the side of the plots indicates the mean of the respective quantity

As we can see most locations have a mean precipitation of around 200 mm/month, over the whole time series. Regionally in the “upper left” corner of the Amazon Basin, mean precipitation is higher or equal to the mean. The reference point for “higher” is the mean of the location means. This region seems to be more or less spatially consistent. The rest of the region with lower mean precipitation has also some small areas where precipitation is again a little bit higher. For example in the upper right corner and on the bottom, right of the middle.

For the standard deviation we also see regional patterns. These patterns overlap with the regions of the mean but their magnitude is flipped. Meaning, in the upper left where we observe larger mean values we generally observe lower standard deviation and in the lower and upper right corners, higher standard deviations.

separately
Mean precipitation values at each location, shown for the different months of a year separately. The white line on the scale at the side of the plots indicates the mean of the respective quantity

Figure 2.3: Mean precipitation values at each location, shown for the different months of a year separately. The white line on the scale at the side of the plots indicates the mean of the respective quantity

We see spatial patterns of the mean evolving over time. For example: From May until August there is a spatial separation in two parts that dissolves in September. As expected there is a large seasonal component regarding the means.

For the standard deviation we see as well large differences in values during different months of the year.

2.2 Glyph plots

This section provides a graphical presentation of the precipitation data known as glyph plots. The idea of glyph maps, its application and general implementation that were used in this section are taken from @:wickham2012glyph. Glyph maps use a small icon or glyph to show multiple values at each location. In our case, we show a complete time series at each location instead of just single values. Different techniques can then be used compare the time series between all locations or their individual shape on a local scale. We will show seasonal, de-seasonalized, and de-seasonalized data on a local scale. Seasonal time series are computed by computing the averages of each month on each location. Each seasonal time series therefore has only 12 values and can be plotted without smoothing. The de-seasonalized time series are computed by omitting the seasonal effects on each time series for the complete observation period and therefore has to be smoothed to be visually inspectable. The de-seasonalized time series then can be used to compare the time series for each location on a common or local scale. On the common scale all values are displayed on the same axis range, while on the local scale the axis are changed so that their ranges refer to the range on the respective location. Rescaling is done as follows

\[\begin{equation} x_{rescaled} = \frac{x - \min(x)}{\max(x) - \min(x)}. \tag{2.1} \end{equation}\]

This will help us to see the changes in value at each location relative to the range of the values at the same location. But this also means that interpreting these plots has to be done carefully because, in this form of display, large difference might actually refer to only small changes in absolute values. It can be due to the small range of values at that location in general, that these changes seem to be large. To aid the interpretation of these plots we can use color shadings to draw attention to areas in which ranges are large, meaning larger differences in their relative values also point to larger differences in their absolute values (i.e unscaled values, values on the global scale). Therefore locations with large ranges are shaded in lighter colors and smaller ranges are shaded in dark color, to make the lighter shaded areas more easily visible.

To improve readability of glyph maps, one can also add boxed for each glyph as well as reference lines for global means. This way the trajectory of the glyphs can be viewed in comparison to the overall mean directly.

Glyph map of seasonal precipitaton pattern. Each location is presented by a time series. The time series are seperated by boxes. The gray reference lines inside the boxes show the mid-range for easier comparison.

Figure 2.4: Glyph map of seasonal precipitaton pattern. Each location is presented by a time series. The time series are seperated by boxes. The gray reference lines inside the boxes show the mid-range for easier comparison.

The above figure is a glyph-map of seasonal precipitation patterns (averages for each month) in the Central Amazon Basin. The gray reference lines show the mid-range for easier comparison of the patterns. We see differences in the seasonal patterns across the map. In the upper left for example, the seasonal patterns stay above mid-range while on the bottom-left they have values clearly towards the low end of the range. Also some areas have multimodal patterns. The patterns differ in range and month of maximum and minimum precipitation.

Glyph map of de-seasonalised and smoothed precipitation. Each location is presented by a time series. The time series are seperated by boxes. The gray reference lines inside the boxes show the mid-range for easier comparison. The time series are scaled globally, same positions inside the cells correspond to the same values in all locations.

Figure 2.5: Glyph map of de-seasonalised and smoothed precipitation. Each location is presented by a time series. The time series are seperated by boxes. The gray reference lines inside the boxes show the mid-range for easier comparison. The time series are scaled globally, same positions inside the cells correspond to the same values in all locations.

This plot shows the smoothed de-seasonalized monthly precipitation, after global scaling. The same position within each cell corresponds to the same value in all locations. Some areas have almost a linear course, increasing, decreasing or constant. Others show a more “wiggly” courses. As overall pattern we can see that the forms of the patterns have a spatial connection, patterns are close to similar patterns, at the same latitude. Also regarding latitude the closer to the equator the less precipitation.

Glyph map of de-seasonalised and smoothed precipitation. The time series are scaled locally, ranges are not the same in all cells. The different ranges are given in color shades, where lighter shading indicates a larger range and darker shades smaller ranges.

Figure 2.6: Glyph map of de-seasonalised and smoothed precipitation. The time series are scaled locally, ranges are not the same in all cells. The different ranges are given in color shades, where lighter shading indicates a larger range and darker shades smaller ranges.

Now we inspect the glyph-map with de-seasonalized locally scaled values. This form of scaling emphasizes the individual shapes. Because of the applied scaling, big patterns may be just be tiny effects. Therefore colors are added according to range. Areas with lighter color have larger ranges than darker areas. The areas with steep linear increases and decreases have smaller ranges than or example the areas below -2.5 latitude in the left.

The results of the precipitation glyphs indicate that the CAB might be separable in different regions. If we can find a way to quantify the differences in these regions and separate them into clusters, we could then apply our regression models to each of these clusters and eventually improve model accuracy on each region as compared to the complete are on average. Therefore in a later section we will discuss and apply clustering algorithms to the precipitation data. But for now we will have a look at the SST data.

2.3 EDA SST

We explore the sea surface temperature data set, used in the paper by Ciemer et al (Ciemer et al. (2020)). ERSST (Extended Reconstructed Sea Surface Temperature, Huang et al. (2017)) is a reanalysis from observed data given in the International Comprehensive Ocean-Athmosphere Data Set (ICOADS). Which contains observations from 1800 until 2016, made by ships and buoys for example. The data comes on a 2x2 degree grid, where data was missing interpolation techniques were used. See paper for reference. the file contains two variables that are measured across different dimensions. The two variables contain the sea surface temperatures and the respective SST anomalies (with respect to the 1971-2000 monthly climatology). Here we analyze the raw SST values, since the anomalies are computed to a climatology that spans a time frame we will use in the analysis. Using the anomalies therefore would eventually introduce information leakage because during the training process future values were used for fitting the model.

Mean and SD on the spatial map.

Figure 2.7: Mean and SD on the spatial map.

We see spatial patterns of the mean evolving over time. For example: From May until August there is a spatial separation in two parts that dissolves in september. As expected there is a large seasonal component regarding the means.

For the standard deviation we see as well large differences in values during different months of the year.

3 Correlation analysis

We give a short overview over the correlation between monthly sea surface temperature and monthly mean precipitation in the Central Amazon Basin (CAB). First we will analyse the original and then the de-seasonalized data. SST and precipitation data have been de-seasonalized, meaning first each time series was decomposed by the stl algorithm according to \[Monthly \textit{ } Data = Seasonal + Trend + Remainder\] Afterwards only trends and remainders time series were kept to constitute a new time series that will be used as predictor (SST) and target (precipitation).

In a next step we compute the pearson correlation coefficient \(\rho\) between each SST grid point time series and the mean precipitation time series.

\[\begin{equation} \rho_{X,Y}=\frac{\mathbb{E}\left[\left(X-\mu_{X}\right)\left(Y-\mu_{Y}\right)\right]}{\sigma_X\sigma_Y} \tag{3.1} \end{equation}\]

Since our goal is to predict the precipitation on the SST information, we are also interested in the correlation of the SST and future precipitation some months ahead. To examine this we also compute the correlations for different time lags. For example for a time lag of 6 month we correlate January SST data and precipitation in July.

We consider time lags of 0,3,6 and 12 months. And show the density of the correlation values as well as their spatial distribution on a map. We also display the highest positive and negative correlation based on their respective 2.5% and 97.5% quantiles. All correlations that are between these values are set to 0 then.

3.1 Correlation of Sea Surface Temperature and Precipitation

3.1.1 Original Data

Following, for each timelag we show the respective density of correlation values, their location on the map and also the 5% strongest positive and negative correlations.

3.1.1.1 Timelag 0

Inspecting the density plot for time lag 0, we see two modi for correlations, one for negative correlations around -0.8 and one for positive correlations around 0.8. Also a small spike can be seen for low negative correlations. If we plot these correlations on the respective grid points we see a clear north-south negative-positive correlation distinction. The “boarder” is organised around the equator. The plot for the strongest 5% of correlations reveals areas with strong positive and negative correlations in the north and south respectively.

3.1.1.2 Timelag 3

The density of correlations for timelag 3, is left-skewed and has two modi that are organised around 0 and -0.125 respectively. The correlation map shows that the high positive and negative correlations are more close to equator here. Note that the legend for the correlationmap is “shifted” here, because the maximal negative correlation has a higher absolute value than the maximal positive correlation. The strongest correlations also seem to be shifted towards the equator.

3.1.1.3 Timelag 6

We can see the density plot for timelag 6 is pretty similar to the one of timelag 0 but seems to be “flipped” around 0. Similarly the correlation map shows (high) negative correlations in the south now and high postive correlations in the north.

3.1.1.4 Timelag 12

Giving a time lag of one year, we can see that the distribution of correlations is now again similar to the distribution for time lag 0. This also hold for the location of positive and negative correlations in general, as well as for the strongest 5% of correlations.

3.1.2 De-seasonalized Data

Following, for each time lag we show the respective density of correlation values, their location on the map and also the 5% strongest positive and negative correlations.

3.1.2.1 Timelag 0

Inspecting the density plot for timelag 0, we see that after excluding seasonality from the time series we get a left-skewed distribution of correlations. With a mode around 0. In general the correlation values are a lot lower than in the original data. With a maximum at around -0.4 and +2.5 respectively. We plot these correlations on the respective grid and see that the clear north south distinction in the correlations before de-seasonalizing the data does not appear anymore. The plot for the strongest 5% of correlations reveals areas with strongest positive and negative correlations. But as stated before the values are in general much lower.

3.1.2.2 Timelag 3

For the density of timelag 3 we get a similar picture as for timelag 0. The mode is a higher and the tails get a bit more mass. Also the correlation map does not seem to change a lot. The strongest correlations appear to be shifted to the left.

3.1.2.3 Timelag 6

Since the distributions of correlation values are all unimodal we do not observe the “flip” we saw in the original data when comparing the densities of timelag 0 and 6. The mode again gets larger and the maximum postive and negative correlation values get smaller. The strongest negative correlations are shifted further to left.

3.1.2.4 Timelag 12

Given a timelag of one year, the distribution now has a mode around 6, and started at around 4 when timelag was 0. Also neither the positive nor negative correlations exceed values of 2.5. The consistent region of strong negative and positive correlations is now less organised or more scattered.

3.2 Summary

3.2.1 Original Data

We can observe that the positive and negative correlations of sst and precipitation follow a spatial and temporal pattern. The location and density of the positive and negative correlation “wanders” over the equator in opposite directions. The densities and correlationmaps for timelag 0 and 6 appear to be quite similar but “flipped”. The densities and correlationmaps for timelag 0 and 12 appear again to be similar. The same pattern seems to hold for the strongest correlations.

3.2.2 De-seasonalized Data

Correlation values are in general a lot lower than in the original data and decrease with increasing timelag. We still observe temporal and regional patterns, although these dissolve a bit for a timelag of 12.

4 Clustering

In this chapter we will first summarize the main ideas of clustering and then apply it to the precipitation data. If not indicated otherwise the information is taken from Elements of Statistical Learning.

4.1 Main Idea Clustering

We can describe an object by a set of measurements or its similarity to other objects. Using this similarity we can put a collection of objects into subgroups or clusters. The objects in the subgroups should then be more similar to one another than to objects of different subgroups. This means inside the clusters we aim for homogeneity and for observations of different clusters for heterogeneity. With the clustering analysis applied to the precipitation data we want to study if there are distinct groups (regions) apparent in the CAB. So that if we later apply the regression models we predict the precipitation for each group and not for the whole region.

To explore the grouping in the data we need a measure of (dis)similarity. This measure is central and depends on subject matter considerations. We construct the dissimilarities based on the measurements taken for each month. We interpret this as a multivariate analysis where, each month is one variable. So given the area in the CAB (resolution 5°x5°), we have 612 cells and 432 months, resulting in a \(612 \times 432\) data matrix. we want to cluster cells into homogen groups.

4.2 Clustering Methods

4.2.1 \(k\)-means

In the following we briefly describe the \(k\)-means procedure. Beforehand we have to specify a number of clusters \(C\) we believe exist in our data. Then we randomly initialize \(C\) cluster centers in the feature space. Now two steps are repeated until convergence:

  1. For each center we identify the points that are the closest to this center.
    These points “belong” now to a cluster \(C\).
  2. In each cluster we compute the mean of each variable and get a vector of means. This mean vector is now the new center of the cluster.

As a measure of dissimilarity we use the Euclidean distance:

\[\begin{equation} d(x_i,x_{i´}) = \sum_{j=1}^p(x_{ij}-x_{i´j})^2=||x_i-x_{i´}||^2 \tag{4.1} \end{equation}\]

Meaning for the points \(i\) and \(i´\) we compute the squared difference for each variable and sum them up. As stated above we are searching for clusters that are themselves compact, meaning homogeneous. We do so by minimizing the mean scatter inside the clusters. We summarize this scatter as within-sum-of-squares

\[\begin{equation} W(C) = \frac{1}{2} \sum_{k=1}^{K} \sum_{C(i)=k} \sum_{C(i´)=k} ||x_i-x_{i´}||^2 = \sum_{k=1}^K N_k \sum_{C(i)=k} ||x_i-\bar{x}_k ||^2 \tag{4.2} \end{equation}\]

where \(\bar{x} = (\bar{x}_{1k},...,\bar{x}_{pk})\) stands for the mean vectors of the \(k\)th cluster and \(N_k = \sum_{i=1}^N I(C(i)=k)\).

4.2.2 \(k\)-means characteristics

variance of each distribution of each attribute (variable) is spherical, variance is symmetrical? all variables have same variance, not the case in our example, therefore scaling or pca equal number of observations in each clusters, we don`t know

Since we use the Euclidean distance the similarity measures will be sensitive to outliers and scale. \(k\)-means assumes that the variance of a variable´s distribution is spherical, meaning it wight not work well in situations that violate this assumptions (f.e non-spherical data). Further assumptions are same variance of the variables, and equally sized clusters. Now how “large” these violations have to be so that \(k\)-means does not work well anymore has no clear-cut answer.

4.2.3 K-medoids

We can adjust \(k\)-means procedure so that we can use other distances than the Euclidean distance. The only part of the \(k\)-means algorithm that uses Euclidean distance is when we compute the cluster centers. We can replace this step by formalizing an optimization with respect to the cluster members. For example so that each center has to be one of the observations assigned to the cluster. K-medoids is far more computationally intensive than \(k\)-means.

4.2.3.1 K-medoids characteristics

K-medoids is less sensitive to outliers, because it minimizes sum of pairwise dissimilarities instead of sum of squared Euclidean distances. As stated above it is also more computationally intensive.

4.2.4 PCA

Goal is reduction of correlated and eventually large number p variables to a few. We accomplish this by creating new variables that are linear combinations of the original ones. We call these new variables principle components. The new variables are not correlated any more and ordered according to the variance they explain. The first \(k < p\) principal components then contain the majority of variance (Fahrmeir et al. (1996)). As they are ordered they also provide a sequence of best linear approximations of our data. Let \(x_1, x_2,...,x_N\), be our observations and we represent them by a rank-q linear model

\[\begin{equation} f(\lambda) = \mu + V_q\lambda \tag{4.3} \end{equation}\]

with \(\mu\) a location vector in \(\mathbb{R}^p\) and \(V_q\) is a \(p \times q\) matrix with columns being orthogonal unit vectors as columns. \(\lambda\) is a \(q\) vector of parameters. In other words we are trying to fit a hyperplane of rank q to the data. If we fit this model to minimize reconstruction error using least squares we solve

\[\begin{equation} \min_{\mu, \{\lambda_i\}, V_q } \sum_{i=1}^N||x_i - \mu - V_q\lambda_i ||^2 \tag{4.4} \end{equation}\]

If we partially optimize for \(\mu\) and \(\lambda_i\) we obtain

\[ \hat{\mu} = \bar{x},\] \[\hat{\lambda_i} = V_q^T(x_i - \bar{x}) \] Therefore we need to search for the orthogonal matrix \(V_q\)

\[\begin{equation} \min_{V_q} \sum_{i=1}^N ||(x_i-\bar{x}) - V_qV_q^t(x_i-\bar{x})||^2 \tag{4.5} \end{equation}\]

We can assume here that \(\bar{x}\) is 0, if this not the case we can simply center the observations \(\tilde{x}_i = x_i - \bar{x}\). \(H_q = V_qV_q^T\) projects each observation \(x_i\) from the original feature space onto the subspace that is spanned by the columns of \(V_q\). \(H_qx_i\) is then the orthogonal projection of \(x_i\) Hence \(H_q\) is also called the projection matrix.

We find \(V_q\) then by constructing the singular value decomposition of our data matrix X. \(X\) contains the (centered) observations in rows, giving a \(N \times p\) matrix. The SVD is then:

\[\begin{equation} X = UDV^T \tag{4.6} \end{equation}\]

Where \(U\) is an orthogonal matrix containing the left singular vectors \(u_j\) as columns, and \(V\) contains the right singular vectors \(v_j\). The columns of \(U\) span the columns space of \(X\) and the columns of \(V\) span the row space. \(D\) is diagonal matrix which contains singular values, \(d_1 \leq d_2 \leq ... \leq d_p \leq 0\). Singular values are square roots of non-negative eigenvalues. The columns of \(UD\) are the principal components of \(X\). So the SVD gives us the matrix \(V\) (the first \(q\) columns give the solution to the minimization problem above) as well as the principal components from \(UD\) (Hastie et al. (2009)).

4.2.5 Gap statistic

The idea of the gap statistic was introduced by R. Tibshirani, Walther, and Hastie (2001) As stated above we usually measure how compact our clusters are by assessing \(W(C)\) or \(log(W_c)\). Where low values indicate compact clusters. To compare the value then, we need a reference. We therefore want to estimate how large \(W_c\) were if there were no clusters present in our data. The larger the difference between the \(W_c\) from the data and the one from the reference the more likely we are to say that the found number of clusters is indeed correct. We construct reference data by sampling from a uniform distribution based on our data. Say we have \(p\) variables. We sample \(n\) times from each of the \(p\) uniformly distributed variables, where maximum and minimum are obtained from our data. We then cluster the reference data in the same we cluster our observed data and compute \(W_c\). We repeat this process several times and compute the average of \(W_c\), \(E \{log(W_c)\}\). The gap statistic is then the difference

\[\begin{equation} E \{log(W_c) \} - log(W_c). \tag{4.7} \end{equation}\]

So in cases where our data is formed of clusters we would expect a high gap statistic. As R. Tibshirani, Walther, and Hastie (2001) note and as it is also done in the used R function clusgap, doing a PCA on the data and compute the gap statistic on the PCA scores can improves the results of gap statistic.

4.3 Clustering results

We summarize and compare the results we obtain if we center or don’t center the data.

4.3.1 \(k\)-means and PAM gap statistics without PCA

Gap statistics for different number of clusters k, for the k-means (left) and the PAM (right) algorithm respectively. The dashed line indicates the optimal number of clusters found.

Figure 4.1: Gap statistics for different number of clusters k, for the k-means (left) and the PAM (right) algorithm respectively. The dashed line indicates the optimal number of clusters found.

We can see that on the original data, neither \(k\)-means nor PAM find an optimal number of clusters that is lower than the maximum number specified.

4.3.1.1 Scree plot

Scree plot with the principal components on the x-axis and the respective percentage of variance explained on the y-axis. The PCA was done after centering the data.The first three and four PC`s together explain 67.77 and 70.69\% of the variance respectively

Figure 4.2: Scree plot with the principal components on the x-axis and the respective percentage of variance explained on the y-axis. The PCA was done after centering the data.The first three and four PC`s together explain 67.77 and 70.69% of the variance respectively

We apply a PCA on the centered data. The variance that is explained by the principal components indicates that, the 3 principal components already explain a lot of the appearing variance in the data. We now study the gap statistic results for \(k\)-means and K-medoids (here, meaning PAM) after applying the PCA and choosing the number of principal components to be used.

Following we compare the results of a gap statistic when the first 3 and 4 principal components are used, for \(k\)-means and PAM respectively. The first three and four PC`s used explain 67.77 and 70.69% of the variance respectively.

4.3.1.2 \(k\)-means and PAM gap statistics after applying PCA

Gap statistic plots for k-means and PAM when using 3 and 4 PC's respectively.

Figure 4.3: Gap statistic plots for k-means and PAM when using 3 and 4 PC’s respectively.

The graphic shows the gap statistics resulting from \(k\)-means and PAM for 3 and 4 principal components respectively. When 3 principal components are used \(k\)-means and PAM find 5 and 12 as optimal number of clusters, respectively. When we choose to use the first 4 principal components, both algorithms choose 13 as the optimal number.

4.3.2 Summary

We used the gap statistic as evaluation tool to find the optimal number of clusters in our data.
It was applied by using \(k\)-means and PAM as cluster algorithms, on the centered data with and without applying a pca beforehand. When applied without PCA, PAM and \(k\)-means found the maximal number of clusters optimal, regardless of centering.
The results differed regarding centering the data when a PCA was applied.
When the data was centered before the PCA, we find 5 and 12 (3 principal components, \(k\)-means and PAM, respectively) and 13 (4 principal components, both \(k\)-means and PAM) as optimal number of clusters. The results overall vary greatly and are sensitive to choices such as number of principal components, the algorithm and the decision criterion (here R. Tibshirani, Walther, and Hastie (2001)). The value of the clustering itself can be evaluated only when used in combination with regression.
If useful, fitting different models for the different clusters should result in a lower overall average prediction error. We will proceed by using \(k\)-means for finding 5 clusters after applying a PCA on the centered data and use 3 principal components.

4.4 Analyse clustering results

We now analyze further the results we found. Namely the clusters we find after performing a PCA on the data, using the 3 first principal components and searching for 5 clusters using \(k\)-means. The first 3 principal components explain 67.8% of the variance We found 5 as optimal number of clusters based on the gap statistic and the criteria from Tibsherani.

We show the map plot for the \(k\)-means clustering after applying the PCA as well as the time series of the resulting clusters.

Spatial distribution of the found clusters in the CAB. We applied a centered PCA on the data and used 3 principal components before applying the k-means algorithm

Figure 4.4: Spatial distribution of the found clusters in the CAB. We applied a centered PCA on the data and used 3 principal components before applying the k-means algorithm

The plot above shows the grid cells in the Central Amazon Basin colored according to the clusters that are assigned by \(k\)-means. The clusters are almost completely spatially coherent. Meaning that the clusters are not scattered across different areas. One exception can be seen for Cluster 1 and 4. Parts of cluster 1 (orange) are inside cluster 4 (blue) and on the edge to cluster 3 (green).

Plots of the time series in the clusters we found using the k-means algorithm. The x-axis shows the month of measurement, the y-axis the centered precipitation. Centering was done according to the overall CAB mean in each month. The mean inside each cluster for a month is displayed in blue.

Figure 4.5: Plots of the time series in the clusters we found using the k-means algorithm. The x-axis shows the month of measurement, the y-axis the centered precipitation. Centering was done according to the overall CAB mean in each month. The mean inside each cluster for a month is displayed in blue.

We now inspect the original (centered) time series inside the clusters.
The time series are shown in gray and the monthly mean in the cluster is shown in blue. Since the time series are centered before applying the PCA and clustering, the zero value is the mean of the respective month of the whole CAB.
The clusters differ in their monthly differences from the monthly CAB mean (here 0 because the time series were centered before PCA and \(k\)-means) and their fluctuation/ variance. Also the size of the clusters are not all the same.
The mean in cluster 3 has lowest variability around the CAB mean, followed by clusters 2 and 5, and clusters 1 and 4 have the highest variability.
On average cluster 3 is on the level of the CAB overall mean. The clusters 2 and 5 are slightly below and cluster 4 is above the CAB mean, on average. Cluster 1 is on average also on the CAB mean but shows more variance than cluster 3.

5 LASSO Regression

5.1 Introduction

We want to create a model that has predictive and explanatory power. Predictive power meaning it can predict the precipitation in the Central Amazon Basin, “reasonably well”. Explanatory power in the sense of being interpretable, so that we can identify those regions in sea that give us most information about future precipitation. Our problem setting is high dimensional with n << p. The number of predictors is a lot bigger than the number of actual observations. This creates issues with a classic linear model since the linear problem is underdetermined. One possible model for the problem at hand is a LASSO regression model.

In general for the linear model:

\[\begin{equation} y_i = \beta_0 + \pmb{x}_i^T\pmb{\beta} + \epsilon_i \tag{5.1} \end{equation}\]

see (5.1) Where the \(y_i\) refers to the mean precipitation for a given month \(i\) and \(x_i\) is the vector of sea surface temperature at different locations around the globe. \(\epsilon_i\) is the residual and we wish to estimate the \(\beta\)´s from the data. As already stated this is not possible with a classic linear model since the number of predictors exceeds the number of observations. We therefore can not estimate a \(\beta\) for every grid point in the sea. From a physical point of view it also seems reasonable that some regions in the ocean have a higher predictive power than others. For example regions that are closer to the Amazon may have more influence on precipitation in the same month. But regions more far away may have more information on the precipitation half a year in the future. We therefore would like to use a model that can find the most important regions in the sea for predicting precipitation for some point in the future. One possible solution for this is a LASSO regression model, as implemented in R by the glmnet package (Friedman, Hastie, and Tibshirani (2010)). This model “automatically” performs model selection, but be aware that because of the time dependencies in our data, normal Cross Validation methods may be unjustified or at least have to be applied with caution. The glmnet package implements the regression problem in the following manner, solving:

\[\begin{equation} \min_{\beta_0, \beta} \frac{1}{N} \sum_{i=1}^N w_il(y_i,\beta_0 + \beta^Tx_i) + \lambda [(1-\alpha)||\beta||_2^2/2 + \alpha||\beta||_1] \tag{5.2} \end{equation}\]

This is a lasso regression for \(\alpha = 1\) and ridge regression for \(\alpha = 0\), \(\alpha\) controls the overall strength of regularization or penalty. Intuitively this means we try to find those \(\beta\)´s that minimize the negative log likelihood of our data (this is equal to maximizing the log-likelihood). But at the same time we can not include too many \(\beta\) since this will make the second and third term in the formula grow. As result the algorithm chooses only those predictors that have the most predictive power. How many predictors are included depends on the strength of regularization given by \(\alpha\). Remark: Among strongly correlated predictors only one is chosen in the classical lasso model. Ridge regression shrinks the coefficients to zero. Elastic net with \(\alpha=0.5\) tends to either include or drop the entire group together. To specifically choose a group of predictors, variations of the lasso or other models have to be considered.

5.2 Implementation

The glmnet function finds a solution path for the lasso problem via coordinate descent. The implemented algorithm was suggested by Van der Kooij (2007). We can write down the optimization procedure as follows: Given \(N\) observation pairs \((x_i, y_i)\) with \(Y \in \mathbb{R}\) and \(X \in \mathbb{R}^p\), we approximate the regression function with \(E(Y|X = x) = \beta_0 + x^T\beta\), Here \(x_{ij}\) are considered standardized, so \(\sum_{i=1}^N=0\),\(\frac{1}{N}\sum_{i=1}^Nx_{ij}^2=1\) for \(j = 1,...,p\). We then solve the problem:

\[\begin{equation} \min_{(\beta_0, \beta)\in\mathbb{R}^{p+1}} \big{[} \frac{1}{2N} \sum_{i=1}^N (y_i - \beta_0 - x_i^T\beta)^2 + \lambda [(1-\alpha)||\beta||_2^2/2 + \alpha||\beta||_1 \big{]} \tag{5.3} \end{equation}\]

Note that this solves the elastic net problem that also uses a ridge penalty. We follow the elastic net description but in our case \(\alpha = 1\), using only the lasso penalty. We consider now a coordinate descent step for solving (5.3). Given we have estimates \(\tilde{\beta_0}\) and \(\tilde{\beta_l}\) and we want to partially optimize with respect to \(\beta_j\), and \(i \neq j\). When \(\beta_j > 0\),

\[\begin{equation} \left.\frac{\partial R_{\lambda}}{\partial \beta_{j}}\right|_{\beta=\tilde{\beta}}=-\frac{1}{N} \sum_{i=1}^{N} x_{i j}\left(y_{i}-\tilde{\beta}_{0}-x_{i}^{\top} \tilde{\beta}\right)+\lambda(1-\alpha) \beta_{j}+\lambda \alpha . \tag{5.4} \end{equation}\]

And similar expressions exist for \(\tilde{\beta}_{j}<0\). \(\tilde{\beta}_{j}=0\) is treated separately. The coordinate-wise update has then the form:

\[\begin{equation} \tilde{\beta}_{j} \leftarrow \frac{S\left(\frac{1}{N} \sum_{i=1}^{N} x_{i j}\left(y_{i}-\tilde{y}_{i}^{(j)}\right), \lambda \alpha\right)}{1+\lambda(1-\alpha)} . \tag{5.5} \end{equation}\]

with

  • \(\tilde{y}_{i}^{(j)}=\tilde{\beta}_{0}+\sum_{\ell \neq j} x_{i \ell} \tilde{\beta}_{\ell}\) standing for fitted value without the contribution from \(x_{i j}\), and therefore \(y_{i}-\tilde{y}_{i}^{(j)}\) is the partial residual when fitting \(\beta_{j}\). Because we applied a standardization, \(\frac{1}{N} \sum_{i=1}^{N} x_{i j}\left(y_{i}-\tilde{y}_{i}^{(j)}\right)\) denotes the simple least-squares coefficient for fitting this partial residual to \(x_{i j}\).

  • \(S(z, \gamma)\) being the soft-thresholding operator. It’s value is given by:

\[\begin{equation} \operatorname{sign}(z)(|z|-\gamma)_{+}= \begin{cases}z-\gamma & \text { if } z>0 \text { and } \gamma<|z| \\ z+\gamma & \text { if } z<0 \text { and } \gamma<|z| \\ 0 & \text { if } \gamma \geq|z| .\end{cases} \tag{5.6} \end{equation}\]

So in summary the steps are as follows: Compute the simple least-squares coefficient on the partial residual, then apply soft-thresholding and proportional shrinkage for the lasso and ridge penalty, respectively. Again for our use case, since we use the lasso and \(\alpha = 1\), we only apply soft-thresholding and no proportional shrinkage.

The solutions are computed starting from smallest \(\lambda_{max}\) for which all elements in \(\hat{\beta}=0\). For all larger \(\lambda\) the coefficients then stay 0. The smallest \(\lambda\) value \(\lambda_{min}\) is then selected by \(\lambda_{min}=\epsilon \lambda_{max}\). The complete searched vector is constructed as sequence of K values, typical values are \(\epsilon = 0.001\) and \(K = 100\). This procedure is an example of so called warm starts. By default they always center the predictor variable. For additional information on other methods how speedup is obtainred refer to Section 2 in Friedman, Hastie, and Tibshirani (2010).

5.3 TODO here

maybe drop into with linear model just put lasso formula directly talk about the stuff that is written already note probelms with correlation of predictors and grouping

5.4 Results

Following we summarize the results of the different lasso approaches. For each approach we show the prediction errors for each test set during the forward validation as well as the actual predictions. The \(\lambda\) that achieved the minimum MSE during the forward validation, will be used to fit the model to the complete training data and evaluated on the hold-out validation set at the end of the observation period. For the full model we show the predictions and resulting coefficient map.

5.4.1 Lasso

Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.1: Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

5.1 shows the MSE for all the 5 folds summarizing for each lambda value the mean MSE (black points) and indicates the dispersion of the MSE by one standard deviation by presenting error bars. The minimal average MSE is found at 1.26 with a value of approximately 750. The MSE differ a lot between folds, as seen by the large errors bars. A common approach for choosing \(\lambda\) from cross validation (or here forward validation), is to choose largest lambda that lays within one standard deviation of the \(\lambda_{\min}\) that minimizes the average MSE for the folds. Here we could not apply such a procedure because the error bars have such a wide spread. For the sake of comparability with other lasso models as well with the fused lasso later, we will choose the \(\lambda_{\min}\) for all models when fitting the full model.

MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.2: MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

In 5.2 we display the prediction errors in each fold when predicting on the test sets. The optimal amount for folds 1 until 4 is are very close, in fold 5 a larger value is chosen. Folds 1 and 2 have a local maximum in the beginning of the path, fold 5 has a local maximum and minimum after the optimal \(\lambda_{\min}\) value. Overall the average regularization seems to be a good compromise here.

Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

Figure 5.3: Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

We now inspect the actual predictions made in the test sets and plot them together with the respective target values (5.3). The precipitation in the test sets are predicted well, in general. But it misses maximum vales and local minimal. Overall the standard lasso shows here it can learn some meaningful connection.

Precipitation prediction and target values in the validation set. Predictions in red and target values in black. The model was fitted on the full CV data with the lambda value that minimised the average MSE

Figure 5.4: Precipitation prediction and target values in the validation set. Predictions in red and target values in black. The model was fitted on the full CV data with the lambda value that minimised the average MSE

## [1] 1314.929

The predictions from the full model on the evaluation set indicate that the model is capable of predicting the seasonal variation in the precipitation data (5.4). But it constantly fails to predict the peaks during the seasonal cycles.

Coefficient plot of the full lasso model with fitted intercept of 434.47

Figure 5.5: Coefficient plot of the full lasso model with fitted intercept of 434.47

Finally we plot the nonzero coefficients from the lasso directly on their locations on the map (5.5. As we can see from the coefficient value legend, the range of coefficient values is tilted towards negative values. Also maybe surprisingly, locations far away from the CAB are included in the model. We can observe negative and positive coefficients in the Atlantic ocean. Some but not of these regions can also be seen to be highly correlated from analysis done on the original and deseasonalised data, see @ref{correlation-chapter}.

5.4.2 standardized lasso

Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.6: Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

The learning curve in 5.6 shows some extreme behavior for \(\lambda\) values around 0 on the logarithmic scale. For lower values the standard deviation of the MSE get extremely large. The minimum MSE is 917.28 Inspecting the MSE lines separately for each fold gives more insight.

MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.7: MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

The folds that drive the large MSE standard deviation are fold 4 and 5 (see 5.7). The range of the best \(\lambda\) value chosen is larger than for the lasso without standardization. Fold 4 chooses very low regularization as optimal.

Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

Figure 5.8: Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

The predictions on the test set are similar to those from the normal lasso, in fold 4 the peak is predicted exactly (??). The Model from fold 2 underestimates more than the respective fold model from the lasso without standardization (see ??).

## [1] 1214.489

The predictions from the full model of the standardized lasso show the same behavior as the ones found in the lasso without standardization. Again seasonal patterns are predicted well but largest values are underestimated (@ref:coef-plot-full-lasso-stand).

Coefficient plot of the full lasso model with fitted intercept of 208.87

Figure 5.9: Coefficient plot of the full lasso model with fitted intercept of 208.87

@ref:coef-plot-full-lasso-stand shows now the coefficient plot for the standardized lasso, the coefficient values are given on the standardized scale. The range of coefficient values is more symmetric around the 0 than for first lasso model and the locations differ as well. For example the two locations at pacific coast of South America are not included in the model here (compare @ref:coef-plot-full-lasso-og).

5.4.3 deseas lasso

Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.10: Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Initially with decreasing regularization the MSE decreases as well before it reaches its minimum and stays fairly constant afterwards (@ref:(err-bar-plot-lasso-deseas)). The minimum average MSE achieved is higher than in the lasso with and without standardization (minimum average MSE here 2381.37). In the decreasing area from the start of the path until its minimum, the size of the error bar increases. We inspect this further in the error plots for each fold.

MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.11: MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

Fold 4 causes the errorbars to increase in range, since it has a local maximum around l\(log(\lambda)\)) 1.25. Fold 5 shows a similar behavior but the size of the local maximum is smaller relatively to the other MSE values for this fold. Apart from that folds 1,2,3,5 show similar trajectories although the range of their MSE values differ (5.11).

Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

Figure 5.12: Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

In 5.12 we observe that the predictions are quite different from the other models so far (@ref{lasso-og}, @ref{lasso-stand}). Folds 1 and 2 are not predicted accurately and in general the predictions are less smooth and don’t work as well as in the other two models, which is also reflected in the higher average MSE minimum in 5.10

## [1] 1809.455

The final predictions appear to be less smooth than in the other models so far (lasso with and without standardization), but generally as well underestimate the higher values in the data and overestimate low values around month 415 (??). We see somehow better predictions for the peak in the first month of the validation period. The amplitude in the seasonal patterns of the predictions decrease faster than in 5.4 and ??. Note that in our forward validation setting, the seasonality used to de-seasonalize the validation data is the last year of the training set. While during the forward validation phase, when we search for the optimal \(\lambda\) to refit the full model, the test sets are smaller hence relate to shorter time periods. For the validation data we actually have 5 years that are de-seasonalized by the year preceding the validation time period. This estimation of seasonality might not be acccurate any more if the seasonality in later years of the validation data is not the same as in the last year from the forward validation data. This might be an explanation why predicting the seasonal precipitation patterns in the later years become less reliable.

Coefficient plot of the full lasso model with fitted intercept of 437.97

Figure 5.13: Coefficient plot of the full lasso model with fitted intercept of 437.97

The model using de-seasonalized SST data chooses many regions in the highest latitudes and few regions around the equator (@ref(fig:5.4)). Many coefficients of opposite signs can appear quite close to another, espicially in the higher latitudes.

5.4.4 diff1 lasso

Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.14: Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

The MSE learning curve in 5.14 has its minimum at \(\lambda=\) 2.21. And shows increasing standard deviation for the MSE values across folds when \(\lambda\) decreases to the end of the path.

MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.15: MSE of the CV for the different lambda values on the a log scale. The red dotted line shows the lambda for which minimum MSE was obtained.

The range of MSE inside the different folds vary but overall the folds show similar trajectories and the optimal regularization chosen occurs at close values (on the log scale, 5.2.

Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

Figure 5.16: Precipitation prediction and target values in the test set in each fold. Predictions in red and target values in black.

Here the prediction inside the folds show similar trajectories to 5.3 or 5.8, note that the prediction period is one month shorter because after differentiating the predictors we shorten the target variable by one month, too (5.16).

As for the other models we looked at before, the amplitudes are underestimated in 5.16. But the amplitudes are not decreasing (which can be observed for the other models and most prominently for 5.12). In the peak around month 420, the predictions actually spike to their maximum value, but at the same time they strongly overestimate the period of low precipitation directly preceding.

Coefficient plot of the full lasso model with fitted intercept of 208.95

Figure 5.17: Coefficient plot of the full lasso model with fitted intercept of 208.95

Investigating the coefficient plot for this model (@(fig:coef-plot-full-lasso-diff1)), we find that here a lot of coefficients have large positive values, for example below the equator in the Pacific and Atlantic.

5.4.5 Lasso on clustered precipitation

Recall that our cluster analysis proposes 5 clusters for the \(k\)-means algorithm. We came to this conclusion after applying a PCA and computing the gap statistic for different values of \(k\). We apply the lasso now on each cluster and show for each cluster the learning curve, predictions and coefficient plot of the full model.

5.4.5.1 Cluster 1

Model: Lasso on cluster 1. Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.18: Model: Lasso on cluster 1. Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

5.18 shows the MSE for the 5 folds when we fit the lasso on cluster 1 only. The learning curve has no steep increases or declines but increases steadily for less regularization after the minimum at 1.88. The minimum mean MSE is 1616.5936087.

Model: Lasso on cluster 1. Precipitation prediction and target values in the validation set. Predictions in red and target values in black. The model was fitted on the full CV data with the lambda value that minimised the average MSE

Figure 5.19: Model: Lasso on cluster 1. Precipitation prediction and target values in the validation set. Predictions in red and target values in black. The model was fitted on the full CV data with the lambda value that minimised the average MSE

As we can see in the predictions on the evaluation set for cluster 1 (5.19), the model here as well can not predict the peaks in precipitations but catches well the lower values and the general seasonal trajectories.

Model: Lasso on cluster 1. Coefficient plot of the full model. Fitted intercept of 337.28

Figure 5.20: Model: Lasso on cluster 1. Coefficient plot of the full model. Fitted intercept of 337.28

The nonzero coefficients chosen by the lasso are all negative (5.20), since the intercept of the model is relatively high (337.28), the coefficients only decrease the predictions and no predicted value is higher than the intercept.

5.4.5.2 Cluster 2

Model: Lasso on cluster 2. Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.21: Model: Lasso on cluster 2. Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

5.21 has more narrow error bars than the same plot for cluster 1. It has optimum at 1.3529678 and a minimum mean MSE is 1250.7993903. And regularization chosen here is lower and the resulting MSE as well.

Model: Lasso on cluster 2. Precipitation prediction and target values in the validation set. Predictions in red and target values in black. The model was fitted on the full CV data with the lambda value that minimised the average MSE

Figure 5.22: Model: Lasso on cluster 2. Precipitation prediction and target values in the validation set. Predictions in red and target values in black. The model was fitted on the full CV data with the lambda value that minimised the average MSE

The final predictions on cluster 2 catch the seasonality as well as high and low values reasonably well (5.22). But it still misses higher values in the 4th and 5th peak

Model: Lasso on cluster 2. Coefficient plot of the full model. Fitted intercept of 481.02

Figure 5.23: Model: Lasso on cluster 2. Coefficient plot of the full model. Fitted intercept of 481.02

The lower \(\lambda\) chosen in cluster 2 results in a higher number of non-zero coefficients than for example in cluster 1 (5.23). Here the intercept is also higher (481.02) than the predicted values. So most of the coefficient values are negative.

5.4.5.3 Cluster 3

Model: Lasso on cluster 3. Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.24: Model: Lasso on cluster 3. Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

5.24 has its optimum at 1.6655202 and a minimum mean MSE is 3091.4484301. The error bars change in size and increase for lower regularization values after the minimum MSE is reached.

Model: Lasso on cluster 3. Precipitation prediction and target values in the validation set. Predictions in red and target values in black. The model was fitted on the full CV data with the lambda value that minimised the average MSE

Figure 5.25: Model: Lasso on cluster 3. Precipitation prediction and target values in the validation set. Predictions in red and target values in black. The model was fitted on the full CV data with the lambda value that minimised the average MSE

The first two seasons of precipitation are predicted well for cluster 3(5.25). But as the values increase in later seasons again the model can not predict the peaks in the precipitation data.

Model: Lasso on cluster 3. Coefficient plot of the full model. Fitted intercept of 241.3

Figure 5.26: Model: Lasso on cluster 3. Coefficient plot of the full model. Fitted intercept of 241.3

Again here the lower \(\lambda\) value leads to a higher number of coefficients chosen (@(fig:cl3-coef-plot)). Also the intercept this time is lower (241.3), there are also large positive coefficients included in the model and predictions are higher than the intercept, too.

5.4.5.4 Cluster 4

Model: Lasso on cluster 4. Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.27: Model: Lasso on cluster 4. Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

5.27 indicates are higher optimal \(\lambda\) and the lowest mean MSE is also higher than in the preceeding clusters.

Model: Lasso on cluster 4. Precipitation prediction and target values in the validation set. Predictions in red and target values in black. The model was fitted on the full CV data with the lambda value that minimised the average MSE

Figure 5.28: Model: Lasso on cluster 4. Precipitation prediction and target values in the validation set. Predictions in red and target values in black. The model was fitted on the full CV data with the lambda value that minimised the average MSE

For cluster 4 we see that the precipitation values in the evaluation set are a lot more “wiggly” than in the other cluster. The higher regularization chosen from the forward validation does not allow the model to predict these fast changing curves well. Again seasonality is predicted reasonably (5.28).

Model: Lasso on cluster 4. Coefficient plot of the full model. Fitted intercept of 147.67

Figure 5.29: Model: Lasso on cluster 4. Coefficient plot of the full model. Fitted intercept of 147.67

The nonzero coefficients for cluster 4 have quite large positive and negative values. The intercept is 147.67(5.29). No locations next to the south-american coast line is chosen.

5.4.5.5 Cluster 5

Model: Lasso on cluster 5. Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

Figure 5.30: Model: Lasso on cluster 5. Mean squared error of the 5-fold blocked cross validation for a range of lambda values on the log scale. The points in the middle represent the average MSE for the respective lambda, the errorbars give the MSE +/- one standard deviation. The dotted line shows the lambda for which minimum MSE was obtained.

5.30 indicates again chooses a higher optimal \(\lambda\). The range of the error bars increases with decreasing regularization

Model: Lasso on cluster 5. Precipitation prediction and target values in the validation set. Predictions in red and target values in black. The model was fitted on the full CV data with the lambda value that minimised the average MSE

Figure 5.31: Model: Lasso on cluster 5. Precipitation prediction and target values in the validation set. Predictions in red and target values in black. The model was fitted on the full CV data with the lambda value that minimised the average MSE

For cluster 5 we also notice that the precipitation values in the evaluation set are very “wiggly”. The higher regularization chosen from the forward validation again does not allow the model to predict these fast changing curves well. Here too seasonality is predicted reasonably (5.31).

Model: Lasso on cluster 5. Coefficient plot of the full model. Fitted intercept of 370.4

Figure 5.32: Model: Lasso on cluster 5. Coefficient plot of the full model. Fitted intercept of 370.4

A very large negative coefficient value is chosen for . The intercept is 370.4(5.32). No locations next to the south-american coast line are chosen.

5.4.5.6 Cluster Summary

Overall we did not improve MSE when fitting the lasso on each cluster and evaluated on the set, but in cluster 2. It seems that when the data has a lot of nonseasonal peaks and valleys, the regularization chosen from the forward validation only suffices for predicting the seasonality. In cluster 2 where the precipitation time series in the validation set is not very “wiggly” (see for example cluster 4 we predict the precipitation data quite well. The cluster analysis helped us finding the homogeneous cluster 2 which is also the largest cluster, recall 4.4.

Table ?? summarize the MSE for each cluster.

5.5 Lasso summary

In this chapter we compared different settings for the lasso model to predict the precipitation in the validation set. We chose a regularization level based on a 5-fold forward validation, meaning that time ordering was preserved during fitting and testing. The \(\lambda\) that minimized the average MSE across all folds was selected and the model was refitted with this value on all training data. This full model was then used to predict precipitation on the validation set that was not used during the training phase. No observation was used for model fitting and testing, meaning there were no overlaps of fold time periods. We inspected the results of the lasso with and without standardization, after de-seasonalizing the SST data according to our validation scheme and after differentiating the SST data. We also used a cluster analysis to separate the CAB in 5 clusters and fit the lasso procedure on each cluster (see @ref{cluster-methods} for details). If we solely compare MSE we find that standardizing the SST data improves the predictions results over the other lasso models. De-seasonalizing the predictors prior to model fitting did not yield competitive results, this might be due to the differences in SST seasonality between the data used for forward validation (estimating optimal \(\lambda\)) and model evaluation. When differentiating the SST data we find that the obtained learning curves show similar trajectories and the dispersion at the optimal level of regularization is comparably small. When fitting the lasso on each of the clusters we found in @ref{clustering-analyze-results.} we don’t improve predictions on most clusters. But for one cluster (cluster 2, @ref{cl2}) the lasso can actually predict the values in the validation set reasonably well since the precipitation values in this area don’t disperse a lot from the seasonal signals. The regions that were included in the model varied greatly across models in general, which can also be explained by the different levels of regularization chosen. But some regions were included more often than others. Areas in the west Atlantic around the equator were included in all models but at different magnitudes. Also around the equator but in the east pacific coastal regions were included in almost all models but in the standardized lasso (see 5.9). While we would expect these regions to be predictive, based on the paper by Ciemer et al. (2020) and our correlation analysis @ref{correlation-chapter}. There also locations far away from the CAB that were included in all models. See for example that all models included a region in the Caspian sea. But in the de-seasonalized model this region has a very low coefficient value so for some larger levels of regularization it would probably not be included. Finally, the lasso can predict the seasonal patterns well, but when using the precipitation averaged over the CAB, it fails to predict the higher values of seasonal amplitudes.

6 The fused lasso

6.1 Introduction

As expected and seen in the results, the different LASSO models choose single SST regions as predictors as opposed to whole regions. When predictors are highly correlated the lasso tends to choose only one of the variables and discard the others. Also the LASSO only regularizes the magnitude of coefficients but ignores their ordering.
We therefore use the so-called fused lasso as implemented in the genlasso package and the respective fusedlasso function. (R. Tibshirani et al. (2005), Taylor B. Arnold and Tibshirani (2020)) The fused lasso is a generalization of the lasso for problems with features that can be ordered in a meaningful way. It penalizes not only the coefficients’ \(L_1\)-norm but also their differences given their ordering, introducing sparsity in both of them. In our case the fused lasso thus penalizes the differences of SST coefficients that are close to each other.

The fused LASSO as implemented in genlasso package solves the problem:

\[\begin{equation} \min_{\beta} 1/2 \sum_{i=1}^n(y_i - x_i^T\beta_i)^2 + \lambda \sum_{i,j \in E} |\beta_i - \beta_j| + \gamma \cdot \lambda \sum_{i=1}^p|\beta_i|, \tag{6.1} \end{equation}\]

with \(x_i\) being the ith row of the predictor matrix and E is the edge set of an underlying graph. Regularizing \(|\beta_i-\beta_j|\), penalizes large differences in close coefficients. In our case “close” means small distances as defined on 2-dimensional longitude/latitude grid. This grid defines a graph that can be used to compute the distances for each location. The third term \(\gamma \cdot \lambda \sum_{i=1}^p|\beta_i|\), controls the sparsity of the coefficients. \(\gamma=0\) leads to complete fusion of the coefficients (no sparsity) and \(\gamma\) > 0 introduces sparsity to the solution, with higher values placing more priority on sparsity. \(\hat{\beta}\) is computed as a function of \(\lambda\), with fixed \(\gamma\).

6.2 Implementation

The summary of the algorithm is taken from the paper proposing the implementation, Taylor B. Arnold and Tibshirani (2016) and the original paper introducing the algorithm R. J. Tibshirani and Taylor (2011). In the fused lasso setting the coefficients \(\beta \in \mathbb{R}^p\) can be thought of as nodes of a given undirected Graph \(G\), with edge set \(E \subset {1,...,p}^2\). Now lets assume that \(E\) has \(m\) edges which are enumerated \(e_1,...,e_m\). The fused lasso penalty matrix \(D\) is then \(m \times p\), where each row corresponds to an edge in \(E\). So when \(e_l = (i,j)\), we write \(l_{th}\) row of \(D\) as

\[\begin{equation} D_l = (0,...,-1,...,1,...) \in \mathbb{R}^p, \tag{6.2} \end{equation}\]

meaning \(D_l\) has all zeros except for the the \(i_{th}\) and \(j_{th}\) location.

(6.1) is solved by a dual path algorithm that was proposed by Taylor B. Arnold and Tibshirani (2016) for different use cases of the (sparse) fused lasso. They describe the dual path algorithm based on the notation of the generalized lasso problem R. J. Tibshirani and Taylor (2011):

\[\begin{equation} \hat{\beta}=\underset{\beta \in \mathbb{R}^{p}}{\arg \min } \frac{1}{2}\|y-X \beta\|_{2}^{2}+\lambda\|D \beta\|_{1}, \tag{6.3} \end{equation}\]

where \(y \in \mathbb{R}^n\) is the vector of the outcome, \(X \in \mathbb{R}^{n \times p}\) a predictor matrix, \(D \in \mathbb{R}^{m \times p}\) denotes a penalty matrix, and \(\lambda \geq 0\) is a regularization parameter. The dual path algorithm solves not the primal but the dual solution of the problem and computes the solution for a whole path instead of single values of \(\lambda\). Hence the “dual” and “path” that make up the name. Taylor B. Arnold and Tibshirani (2016) argue that the strength of the original algorithm lays in the fact that it applies to a unified framework in which \(D\) can be a general penalty matrix. Let’s consider the case when \(X=I\) and \(rank(X)=p\) (this is called the “signal approximator” case), the dual problem of (6.3) is then:

\[\begin{equation} \hat{u} \in \underset{\substack{u \in \mathbb{R}^{\omega}}}{\arg \min } \frac{1}{2}\left\|y-D^{T} u\right\|_{\frac{2}{2}} \text { subject to }\|u\|_{\infty} \leq \lambda. \tag{6.4} \end{equation}\]

The primal and dual solutions, \(\hat{\beta}\) and \(\hat{u}\) are related by:

\[\begin{equation} \hat{\beta}=y-D^{T} \hat{u} . \tag{6.5} \end{equation}\]

While the primal solution is unique, this does not need to be the case for the dual solution (note the element notation in (6.4). The dual path algorithm starts at \(\lambda = \infty\) and computes the path until \(\lambda = 0\). Conceptually the algorithm keeps track of the coordinates of the dual solutions it computed for each lambda \(\hat{u}(\lambda)\). The solutions are equal to \(\pm\lambda\), meaning they lie on the boundary of the region \([-\lambda,\lambda]\). Along the path it computes the critical values of \(\lambda\), \(\lambda_1 \geq \lambda_2,...,\) at which the coordinates of these solutions hit or leave the boundary.

There are two algorithms described in the paper and the various specialized implementations that can increase efficiency depending on the use cases. This depends on \(X\), and/or the special structure of \(D\). Algorithm 1 handles the \(X=I\) case and Algorithm 2 the general \(X\) case. As we introduced the dual in (6.4), it assumed \(X=I\), which is not satisfied in our case. For the general \(X\) case the problem formulation can be rewritten so that the formula only changes \(D\) and \(y\) to \(\tilde{D}\) and \(\tilde{y}\) and then the same algorithm can be applied. \(\tilde{D}=DX^+\) and \(\tilde{y}=XX^+y\), where \(X^x\) is the Moore-Penrose pseudoinverse of \(X \in \mathbb{R}^{n \times p}\). Algorithm 2 therefore transforms \(X\) and \(y\) in a certain way and then applies Algorithm 1 to the transformed problem. It’s also easy to see that in our case \(p > n\) and \(X\) is column rank deficient. They solve this by adding a small fixed \(l_2\) penalty to the original problem,which leads to:

\[\begin{equation} \operatorname{minimize}_{\beta \in \mathbb{R}^{p}} \frac{1}{2}\|y-X \beta\|_{2}^{2}+\lambda\|D \beta\|_{1}+\varepsilon\|\beta\|_{2}^{2}, \tag{6.6} \end{equation}\]

and this is the same as

\[\begin{equation} \underset{\beta}{\operatorname{minimize}} \frac{1}{2}\left\|y^{*}-\left(X^{*}\right) \beta\right\|_{2}^{2}+\lambda\|D \beta\|_{1}, \tag{6.7} \end{equation}\]

with \(y^{*}=(y, 0)^{T}\) and \(X^{*}=\left[\begin{array}{c}x \\ \varepsilon \cdot I\end{array}\right]\). Because \(\operatorname{rank}\left(X^{*}\right)=p\), again it is possible to apply one of the algorithms. Instead of solving linear systems in each, we can apply a QR decomposition that can be updated in a neat way to avoid solving the complete linear system in each step. See the appendix of Taylor B. Arnold and Tibshirani (2016), for details. Special care has to be taken though for general X and certain \(D\). “Blindly” applying the algorithms then would lead to a large drop in relative efficiency. Since \(\tilde{D}= DX^+\) destroys the special structures that are present in the penalty matrix, special implementations are constructed including our case with general \(X\) and \(D\) coming from the sparse fused lasso.

It can be shown that the computational costly steps in the algorithm reduce to solving linear systems of the form \(DD^T = Dc\). When \(D\) is the oriented incidence matrix of a graph, \(D\) will be sparse but can also be rank deficient when the graph has more edges than nodes, hence \(m > p\). For sparse undetermined systems it is possible to find an arbitrary solution (here called the basic solution), but computing the solution with minimum \(l_2\) norm is a lot more difficult in general Van Loan and Golub (1996) . It is possible though to derive the minimum \(l_2\) norm solution from the basic solution. In the case of penalty matrices that come from a graph the structure of \(D\) can be used to improve efficiency when When \(D\) is the incidence matrix of a graph, then \(D^TD\) is the Laplacian matrix of G. The Laplacian linear systems are then solved using a sparse Cholesky decomposition. For further details of the steps used in our case refer to Section 4 and 5 in Taylor B. Arnold and Tibshirani (2016).

7 References

Arnold, Taylor B., and Ryan J. Tibshirani. 2020. Genlasso: Path Algorithm for Generalized Lasso Problems. https://CRAN.R-project.org/package=genlasso.
Arnold, Taylor B, and Ryan J Tibshirani. 2016. “Efficient Implementations of the Generalized Lasso Dual Path Algorithm.” Journal of Computational and Graphical Statistics 25 (1): 1–27.
Ciemer, Catrin, Lars Rehm, Juergen Kurths, Reik V Donner, Ricarda Winkelmann, and Niklas Boers. 2020. “An Early-Warning Indicator for Amazon Droughts Exclusively Based on Tropical Atlantic Sea Surface Temperatures.” Environmental Research Letters 15 (9): 094087.
Climate Impact Research (PIK) e. V., Potsdam Institute for. 2022. “Potsdam Insitute for Climate Impact Research.” 2022. https://www.pik-potsdam.de/en.
Fahrmeir, Ludwig, Wolfgang Brachinger, Alfred Hamerle, and Gerhard Tutz. 1996. Multivariate Statistische Verfahren. Walter de Gruyter.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. https://doi.org/10.18637/jss.v033.i01.
Funk, Chris, Pete Peterson, Martin Landsfeld, Diego Pedreros, James Verdin, Shraddhanand Shukla, Gregory Husak, et al. 2015. “The Climate Hazards Infrared Precipitation with Stations-a New Environmental Record for Monitoring Extremes.” Scientific Data 2 (1): 1–21.
Hastie, Trevor, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Vol. 2. Springer.
Huang, Boyin, Peter W Thorne, Viva F Banzon, Tim Boyer, Gennady Chepurin, Jay H Lawrimore, Matthew J Menne, et al. 2017. “NOAA Extended Reconstructed Sea Surface Temperature (ERSST), Version 5.” NOAA National Centers for Environmental Information 30: 8179–8205.
Schulzweida, Uwe. 2019. “CDO User Guide (Version 1.9. 6).” Max Planck Institute for Meteorology: Hamburg, Germany.
Smith, Thomas M, Richard W Reynolds, Thomas C Peterson, and Jay Lawrimore. 2008. “Improvements to NOAA’s Historical Merged Land–Ocean Surface Temperature Analysis (1880–2006).” Journal of Climate 21 (10): 2283–96.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88.
Tibshirani, Robert, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. 2005. “Sparsity and Smoothness via the Fused Lasso.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (1): 91–108.
Tibshirani, Robert, Guenther Walther, and Trevor Hastie. 2001. “Estimating the Number of Clusters in a Data Set via the Gap Statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2): 411–23.
Tibshirani, Ryan J, and Jonathan Taylor. 2011. “The Solution Path of the Generalized Lasso.” The Annals of Statistics 39 (3): 1335–71.
Van der Kooij, Anita J. 2007. Prediction Accuracy and Stability of Regression with Optimal Scaling Transformations. Leiden University.
Van Loan, Charles F, and G Golub. 1996. “Matrix Computations.”